Finite-temperature local protein sequence alignment: 
percolation and free-energy distribution 
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Sequence alignment is a tool in bioinformatics that is used to find homological relationships in 
large molecular databases. It can be mapped on the physical model of directed polymers in random 
media. We consider the finite-temperature version of local sequence alignment for proteins and 
study the transition between the linear phase and the biologically relevant logarithmic phase, where 
the free-energy grows linearly or logarithmically with the sequence length. By means of numerical 
simulations and finite-size scaling analysis we determine the phase diagram in the plane that is 
spanned by the gap costs and the temperature. We use the most frequently used parameter set for 
protein alignment. The critical exponents that describe the parameter driven transition are found 
to be explicitly temperature dependent. 

Furthermore, we study the shape of the (free-) energy distribution close to the transition by rare- 
event simulations down to probabilities of the order 10 -64 . It is well known that, in the logarithmic 
region, the optimal score distribution (T — 0) is described by a modified Gumbel distribution. We 
confirm that this also applies for the free-energy distribution (T > 0). However, in the linear phase, 
the distribution crosses over to a modified Gaussian distribution. 

PACS numbers: 87.15.Qt,87.14.E- ,05.70. Jk 



I. INTRODUCTION 

Biological sequence analysis is an interdisciplinary sci- 
entific field which uses concepts from statistics, computer 
science and molecular biology. Some approaches used in 
the context of sequence analysis are, from a conceptional 
point of view, related to models in statistical mechan- 
ics of disordered systems. One of the most fundamental 
tools in the area of sequence analysis is sequence align- 
ment (see for example [l|, [|J]). It is used to quantify sim- 
ilarities between two (or more) biological sequences, like 
DNA, proteins or RNA. Modern search tools for large 
databases, like BLAST 0] or FASTA Q, heavily rely on 
sequence alignment algorithms. 

In this article we consider algorithms for pairwise lo- 
cal protein alignment which aims at finding "conserved" 
regions of two input protein sequences. The most promi- 
nent example is the Smith- Waterman algorithm [5j. The 
algorithm finds optimal alignments (OA) according to an 
objective function. Each alignment is assigned a score 
which is maximal for optimal alignments. The optimal 
alignment score serves as a scalar measure of similarity of 
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the input sequences. Since alignments have a geometrical 
interpretation as directed paths the problem of find- 
ing an optimal alignment is directly related to the ground 
state of directed paths in random media @, H, OH ITlj 
(DPRM) in 1 + 1 dimensions. From this point of view, the 
alignment score corresponds to the negative energy and 
the optimal alignment to the ground state of the system. 

However, in some cases optimal alignments are not 
desirable and one is interested in ensembles of proba- 
bilistic alignments. This is particularly the case when 
one wishes to compare so called weak homologs, i.e. se- 
quences that are related on a relatively long evolutionary 
time scale. In the literature some examples can be found, 
where probabilistic alig nments clearly outperform opti- 
mal alignments [HI, Il3l 0, EH • From the physical per- 
spective, a natural generalization to probabilistic align- 
ments can be achieved by introducing a temperature and 
considering canonical ensembles of alignments for each 
pai r of input sequences instead of the ground state alone 
[I2I [Ifl [l7| . The finite temperature approach provides 
also interesting applications when one wishes to assess 
the reliability of alignments by so called posterior prob- 
abilities 0, [3. 

For both approaches, for OA and for finite-temperature 
alignments (FTA), the choice of the algorithmic param- 
eters remains ambiguous. In particular, the choice of 
the so called gap-costs (see below) requires some heuris- 
tic experimentation. Interestingly, this question can be 
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approached by the theory of critical phenomena. The 
study of sequence alignment from that perspective yields 
interesting results that have improved the optimal choice 
of parameters of sequence alignment fl7l . flpL Hoj . The 
linear-logarithmic phase transition [20. |2U l22j| is the most 
important aspect regarding this issue. The name stems 
from the fact that there is a continuous, parameter-driven 
transition between phases where the average score grows 
linearly or logarithmically with sequence length, respec- 
tively. There is much empirical evidence that the opti- 
mal choice of scoring parameters is close to the phase 
boundary on the logarithmic side [HI, [24[ ■ The underly- 
ing reason is that the transition is driven by the balance 
between the score matrix that measures the similarity be- 
tween letters of the underlying alphabet (i.e. amino acids 
in the case of protein alignment) and the gap costs. The 
latter ones control how strong insertions or deletions of 
subsequences are to be penalized. Hence, one would like 
to identify similar regions, which means to try to avoid 
gaps, giving them a penalty. On the other hand, one 
would like to ignore small local evolutionary changes to 
the sequences, which means one should not make the gap 
penalty too strong. This leads to an optimum choice of 
the gap penalty parameters at "intermediate values" . 

At T = 0, i.e. for OA, Hwa and Lassig have studied 
the transition by looking at the dynamic growth of the 
local (an d global) score when advancing in the search 
space [24|. Later, the critical values were studied analyt- 
ically by a self-consistent equation [13] or numerically by 
a finite-size scaling analysis |25j . Both studies rely on a 
simple scoring model with a single mismatch parameter. 
In the latter procedure the problem was approached by 
considering the linear-logarithmic phase transition as a 
percolation phenomenon [26| . 

The aim of our study is to go beyond the models that 
have been considered so far. In particular, we studied 
the most widely used protein alignment model, i.e. lo- 
cal alignment with the scoring matrix blosum62 [27j and 
affine gap costs (see below), where the linear- logarithmic 
phase transition is of actual relevance to the database 
queries or alignment analysis of protein sequences. 

We considered the geometrical interpretation of align- 
ments and studied numerically the percolation proper- 
ties of OA and FTA. This allowed us to determine criti- 
cal exponents that describe the parameter driven linear- 
logarithmic phase transition. Furthermore we deter- 
mined the phase diagram in the plane that is spanned by 
the temperature and the gap costs. Finally, we studied 
the distribution of the optimal score and the free energy 
close to the transitions. 

In the following section we review the model and algo- 
rithms to compute the partition functions and methods 
to sample alignments from the canonical ensemble. The 
main results for different observables and the (free-) en- 
ergy distributions are presented in Sec. IIII1 followed by 
a discussion in Sec. [IVJ 



II. PARTITION FUNCTION CALCULATION 
AND SAMPLING 

An alignment relates letters from one sequence a = 
ai...<2L G S L to a second one b = bi . . . 6m G 
E M where £ denotes the underlying alphabet. Here, 
we consider protein sequences wherein E is given by 
the 20 letter amino acid alphabet. Given the pair a 
and b, an alignment A is an ordered set of pairings 
{(zi, ji), . . . , (i Nm , ]N m )} with 1 < i k < i k+ i < L, 
1 < jk < ik+i < Af. If a ik = b jk the pair (i k ,jk) is called 
a match otherwise a mismatch. Consequently, we will 
refer to N m as the number of matches plus mismatches. 
The state space of all alignments of a and b shall be 
written as Xa,t>- 

When comparing sequences, one has to account for so 
called insertions or deletions of subsequences that occur 
in evolutionary processes. Regarding alignments, these 
processes are represented by gaps, which are defined as 
follows. If ik+i = ifc + 1 and jk+i = jk + 1 + 1 with / > 
and (ik, jk), (ik+i, jk+i) € -4j then b is said to contain a 
gap of length I between jk and jk+i and likewise for the 
sequence a. If ji = I + 1 > 2, then b is said to have a gap 
of length I at the begin, if j^ = M — I < M, then b has 
a gap of length I at the end and likewise for the sequence 
a. 

For the comparison of sequences its relevant to give a 
measure for the similarity or the degree of conservation 
between the sequences or regions of the sequences under 
consideration. The classical way to accomplish this is to 
assign a score for each alignment via an objective func- 
tion S : %a,b — * K and then maximizing S among all 
alignments 

So(a, b) = max<S(.4; a, b) 

A opt = argmax5(^;a,b). (1) 

For the choice of the objective function and its param- 
eters it is necessary to decide 

(i) whether we are interested in a locally conserved re- 
gion or whether the entire sequences should be con- 
sidered, 

(ii) how matches and mismatches should be evaluated, 
and 

(iii) how gaps should influence the overall score oder how 
a gap penalty should affect the overall score. 

To address the first issue there are in principle two 
types of objective functions available, namely optimal lo- 
cal alignment scores Sg ocal and optimal global alignment 
scores Sq ° a . Optimal global alignment scores involve 
contributions from all matches, mismatches and gaps. 
Based on this, the optimal local alignment score is the 
optimum of all global alignments of all possible contigu- 
ous subsequences of a and b, 

S 1 ocal (a,b)= max Sf bai (<v • • • <4, V • • • fy)- (2) 

\<i' <i<L 
\<j'<j<M 
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The second issue requires the knowledge of a relation- 
ship between the letters of the underlying alphabet. This 
is usually realized by substitution or score matrices that 
assign each pair of letters a number a(a,b). Here, we 
use the most frequently used matrix, blosum62 (27j . In 
most cases, scoring matrices are derived from biological 
data by the scaled log-odd ratio where P a b is the 

J a Jb 

probability of observing the pair of letters a and b, f a 
and fb denote the background frequencies of observing 
the letters a and b independently and A defines a scale 
1]. The entries are usually rounded to integers. 

Regarding the gaps, one compromises between a com- 
putational feasible and a biological evident penalty func- 
tion g. That means, each gap T of length lr yields a neg- 
ative contribution of — g(Jr) to the overall score, which is 
then defined as 



S(A; a, b) 



E 



a(ai,bj) 



Esw. ( 3 ) 



g is usually a monotonously increasing function of the gap 
length. The alignment algorithms for gaped alignments 
with arbitrary gap penalties exhibit a cubic time com- 
plexity (0(max(L, M) 2 min(L, M))). In practice affine 
gap cost functions 

g(l T ) = a opcn + (3 cxt (h-l), with a opcn > /3 oxt > 

(4) 

are commonly used, because the computational complex- 
ity then reduces to 0(LM) The parameters a opcn 
and (3 ext are called gap open penalty and gap extension 
penalty respectively. With the above choice one mimics 
the biological observations that 

(i) longer gaps appear less frequently than shorter ones 
and 

(ii) opening a gap is less likely than extending an ex- 
isiting one. 

There is some evidence that this form describes the 
natural process of insertions and deletions quite well 
gl lip. 

An alignment can be represented as a directed path on 
a lattice of size L x M (see Fig. [1]). The path is given 
by the set of matches and mismatches and gaps in the 
alignment A. Due to the conditions ik < ik+i and < 
jk+i the path is directed. By convention, we say that 
each path element may be orientated in (—1, —1), (—1, 0) 
or (0,-1) direction. Diagonal elements denote matches 
or mismatches and lr consecutive vertical or horizontal 
elements correspond to gaps of length lr in one of the 
sequences. 

In order to keep the path representation for local align- 
ment unique we require that 

(i) the first and the last path element always points in 
(—1, —1) direction, hence gaps at the begin and end 
of the alignment never occur, and 




J 



FIG. 1: Representation of an alignment as a directed path 
in quenched disorder. The disorder is realized by random 
sequences. 



(ii) a gap in the sequence b is not allowed to directly 
follow a gap in sequence a (see [Hj]). 

The optimal alignment can be computed by the dy- 
namic programming algorithm (like a transfer matrix 
method) by Smith and Waterman p|. For affine gap 
costs it requires three (L + l) x (M + 1) matrices Dij, Pij 
and Qij that are computed iteratively [28[. The matrix 
element Dij is the optimal local alignment score of the 
subsequences a\ . . . and b\ . . . bj given that a; and bj 
are paired. Pij and Q^j are auxiliary matrices storing 
the optimal alignment score of the subsequences a\ . . . et^ 
and b± . . . bj given that the alignment ends in a gap in 
either sequence. The recursion relation to compute these 
matrices reads as 



Da 



a(a t ,b J ) 




Pij = max 



A 




- a opcn 


Qi- 


-ij 


- a opcn 


P l - 




- (3 cxt 



5»)J 



max 



Aj_i - a open 



with the boundary conditions 

A,o = Pi,o = Qi,o = -oo 
D ,j = Po,j = Qo,j = -oo 



(5) 



for i = . . . L 
for j = . . . M 



In a physical interpretation a(ai,bj) plays the role of a 
random chemical potential with quenched disorder and 
the gap cost function g(lr) describes a line tension that 
forces the alignment path on a straight diagonal line. 

The optimal alignment score, which in physical terms 
corresponds to the negative ground-state energy, is given 
by Sjf cal (a, b) = max{max iJ {D hj } , 0}. If S{, ocal (a, b) = 
0, the optimal alignment is the empty alignment (a path 
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of length 0). Otherwise the alignment starts at the po- 
sition of the maximum of Dj j . The optimal alignment 
(ground state) can be determined by a backtrace pro- 
cedure. Given that the current state at position 
is a match or mismatch, the path is extended in di- 
agonal direction if Dij = a(di,aj) + Di-ij-i and ac- 
cordingly in vertical or horizontal direction if £),y = 
a(a,i, a,j)+Pi-i t j-i or D hj = a(a l7 aj)+Q i ^ij^ 1 . Similar 
conditions appear, if the current state is a gap in either 
sequence. This is repeated until a match/mismatch with 
Di j = a(ai,bj) is met. In the linear phase, the ground 
state might be highly degenerate. For this reason one 
should include additional matrices N-j\N-j\n^ that 
account for the degeneracies of the alignments that end 
at With help of these matrices it is possible to 

sample all ground states uniformly. 

In the picture of DPRM one usually uses a temporal 
and a spatial coordinate which are defined as 

t=^(i+j) and x = hi-j). 

A local alignment problem is seen as a dynamical growth 
process which starts at space-time event (to, xq), where 
the dynamic programming matrix D^j is maximal. In 
each time step described the path is extended by one di- 
agonal, vertical or horizontal element. The spatial vari- 
able x — xq describes the deviation from a straight diago- 
nal line for each time step. The stopping condition Dij = 
a(ai,bj) given that the current state is a match defines 
the final point (t\,xi) in the space-time. We define the 
roughness of the path as the maximal deviation from a 
straight diagonal line, i.e. A = maxt 1 <t<t \x(t) — xq\. 
Note that this definition refers only to the local align- 
ment path and is "time independent" in contrast to Ref. 

M- 

Next, we describe the generalization of the alignment 
problem to a canonical ensemble of alignments. Let us 
consider the canonical ensemble of all alignments A for 
a quenched pair of sequences. The partition function at 
temperature T is given by 

Z T = ex P i S i A ' a M/T}- 

A 

This sum can be computed by a generalization of Eq. ([5]) 

z% = (i + ^ i , h + C 1 ,h + ^h)-^ )/t 
= fa i, + ■ e- Q ° PC " /T + zZ-u ■ *-^ IT 
y-l = ^-i-- QOPCn/T + ^- 1 - e -^ t/T (6) 

with the boundary conditions 

^o = <o = ^o =0 fari = 0...£ 
Z Zi = Z ii = Z ti = forj = 0...M. 

Since an alignment may start anywhere and may also 
include the empty alignment, the full partition function 



is given by 

L M 
i = l j=l 

Note that contributions from Z p and Z® are explicitly 
excluded because they are auxiliary only and contain 
non-canonical alignments. In the limit T — > Eq. © 
reduces to the recursion relation of the original Smith- 
Waterman algorithm Eq. (f5|) . Once the transfer matrices 
Z?j, Zfj and Zf^ are determined it is possible to di- 
rectly draw alignments from the canonical distribution 
Pt [A] = exp[S(^4; a, h)/T]/Zx with zero autocorrela- 
tion. This direct Monte Carlo algorithm was proposed 
by Muckstein et. al. [12| for local alignment. A general 
description of such methods are presented in the text- 
book of Durbin et. al. 



III. RESULTS 

To study properties of the linear-logarithmic phase 
transition, we generated ensembles of n samplc random se- 
quences which were drawn from the distribution P(a) — 

Y[f=i fa,i , where / are the amino acid frequencies that 
were derived together with blosum62 27}. Furthermore 
we only consider pairs of sequences of equal length, i.e. 
L = M, between L = 40 and L — 5120. It turned out 
that for the finite-size-scaling analysis, that is discussed 
in the following, only system sizes with L > 480 yield 
consistent results. The number n 13 ™? 16 of samples varied 
between 6400 for L = 480 and 800 for the largest system. 

For each sample, n allgn = 100 alignments were drawn 
from the canonical ensemble at various temperatures T 
using the backtracing procedure as described above. We 
used different gap-open parameters a open and tempera- 
tures T between T = and T = 4. The gap extension 
parameter f3 cxt was set to 1 throughout. 

The case T = corresponds to optimal alignments 
(ground states). Note that for small gap costs (i.e. in 
the linear phase) the ground-state degeneracy grows ex- 
ponentially with the system size, whereas in the loga- 
rithmic phase usually only a few optimal alignments are 
observed. Thermal averages and averages over ground 
states over a fixed realization of a sequence pairs will be 
denoted as (-)t and (-)o, respectively. Averages over re- 
alizations of random sequence pairs will be written as [•] 
in the following. In statistical mechanics of disordered 
systems the latter one is often called average over the 
disorder. 

So as to describe the linear-logarithmic transition, we 
considered different observables described in the subse- 
quent sections. 
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FIG. 2: Results for the number N m of aligned letters 
(matches plus mismatches) as a function of gap opening 
penalty a opcn . Curves for different sequence lengths intersect 
at the critical parameter a c . For a more clear presentation, 
single data points are only shown for one system size. Inset: 
After rescaling the abscissa with appropriate values for q c and 
v the data listed in Tab. U collapses on a single master curve. 



A. Geometric properties 

Here, we describe the results for the number N m of 
paired letters (matches plus mismatches). This quantity 
turned out to be an adequate quantity to extract proper- 
ties of the phase transition, such as the critical gap costs 
a c (T) and the scaling behavior close to criticality. 

We consider the averaged number [(N m )T] /L of paired 
letters per sequence length as a function of gap opening 
penalty a opcn and temperature T. Hence, it is the frac- 
tion of matches/mismatches with respect to the max- 
imal possible number of pairs. This observable corre- 
sponds to the percolation probability (the probability 
that a geometrical object spans the entire lattice) in stan- 
dard percolation theory. Usually a crossover from 1 to 
is observed when passing the phase boundary. Here, a 
perfectly percolating local alignment N m — L is hardly 
found even in the percolating phase. Nevertheless, we 
applied the same finite-size-scaling analysis as for the 
usual percolation probability, because [(iV TO )r] /L is in 
the order of unity in the linear regime and vanishes in 
the logarithmic regime. 

Fig.[5]displays [(iV m )r] /L as a function of the gap open 
penalty a opcn for different lengths L and zero tempera- 
ture. The curves for different sequence lengths intersect 
at the critical value a c as expected for a second order 
phase transition. Using finite-size-scaling theory [26| , we 
may extrapolate data from finite sequence lengths to the 
thermodynamic limit L — > oo. In this limit the observ- 
able [(-/V m )x] /L is in the order of 1 below the threshold, 
i.e. a opon < a c , and it jumps to exactly at a c . In fi- 
nite systems, L < oo, the crossover extends over a range 
~ L~ x l v as can be seen in Fig. [5J Scaling theory leads 




open 

a 

FIG. 3: Critical fluctuations of N m . The positions of the 
peaks approach the critical value a c the and their heights 
diverge like U'l" . Inset: fit of Xmax(i) to the scaling form 
~ L 7/ ". 

us to expect that the behavior of [{N m ) T } (a opon ;T;L) 
close to criticality is described by 

[<AWt] K pon ; T; L)/L = f ((a opcn - a c )L l ' v ) , (7) 

where / is an universal scaling function and the exponent 
v describes the divergence of the "correlation length" at 
the critical point a opon = a c . We used Eq. {7} to ex- 
tract numerical values for the critical exponents v and the 
critical gap costs a c with high precession from all data 
for a fixed temperature simultaneously. The fit is per- 
formed by minimizing a weighted-x 2 -like objective func- 
tion Q(a c ,v) [32[, that measures the distance (in units 
of the standard error) of the data from the (a prior un- 
known) master curve. For the example in Fig. [21 i.e. 
T = 0, we obtained a c = 8.306(4) and v = 1.58(5) with 
acceptable quality of Q = Q(a c ,v) — 2.2 (Q should be 
in the order of 1 [32|]). Statistical errors have been deter- 
mined by bootstrapping [H, HI] . Repeating this analysis 
for several temperatures one may probe the critical line 
a c (T) (see below). 

We also tested a related quantity, which is defined as 
the number of matches / mismatches plus the number of 
gaped letters N m + N g , which results in the same crit- 
ical exponent and critical point within error bars (not 
shown). Gaps seem to play only a marginal role close to 
the critical point. We observe that the roughness A of 
the alignment path as defined above at the critical point 
diverges only logarithmically with the sequence length. 
This supports the equivalence the description of the de- 
scription of both variants of the this quantity. 

Next, we study the critical fluctuations of N m . We 
define the susceptibility-like quantity \ = ([(-^m)^] — 
[{N m ) T f)/L as a function of a opcn (see Fig.©. 

Close to the critical point x diverges like \ ~ L 1 ^ . In 
order to extract the height of the maxima from x(a open ) 
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(N m - ^)/o T T 



FIG. 4: Rescaled distributions of the observable N m for T = 
0. Inset: scaling analysis of the moments. The first moment 
scales as fii ~ L. Higher order moments increase slower than 
Mfe ~ L k . 



we performed parabolic fits in the form x(a opcn ) = 
—C(a c (L) — a open ) 2 + x ma x(^) for each system size L. 
The exponent 7 itself is determined by a fit of Xmax(i) 
to the scaling form ~ L 1 ^ . For T — 0, we obtain 
7/1/ = 0.95(1). 

One may also use the scaling of a c (L) to determine the 
critical value a c and the critical exponent v as a cross 
check via olq{L) — a c — AL~ V . When restricting the 
range to L > 1280 we obtain v = 1.4(3) for T = which 
agrees within the error bars with the value stated above. 
Furthermore the critical value a c agrees within 1.5 stan- 
dard deviations. We also checked that those values agree 
for FTA, T = 1. 

Alternatively, one can determine 7 from the second 
moment of the averaged distribution [P(N m )] of N m at 
a opcn = a c . Hence, we performed further simulations at 
criticality with a larger sample size (for the largest system 
size, n sam P lc « 2 x 10 4 for T = and n sam P lc = 1.6 X 10 4 
for T > 0) . This allowed us to cross check the value of 7 
and to extract higher moments of the distribution. 

Fig. H displays distributions [P((N m ) T )] for T = 
(similar results have been obtained for T > 0). The dis- 
tributions have been rescaled to zero mean and unit vari- 
ance. In all cases we observe that the first moments /ii 
scale as \i\ ~ L 1+e with e = within the errorbars. For 
the second moment one would expect the scaling behav- 
ior ii 2 = L\ ~ L 1+7/ly . Indeed we find j/v = 0.955(8) 
for OA by a least square fit. This is consistent with the 
numerical value obtained by the height of maxima. The 
third and the fourth moment scale as 113 ~ L 2+73 and 
(14 ~ L 3+7i respectively. For temperatures T < 2 both 
exponents 73 and 74 agree within the statistical errors. 

The resulting critical values a c and critical exponents 
v \ 7j 73:74 are summarized in Tab.|TJ All ratio of expo- 
nents, 7/V, 73 /V and 74/1/, are in the order of 1 and seem 
to increase with the temperature. Note that for a per- 



FIG. 5: Results for FTA. Left: Phase diagram for FTA. The 
linear phase is located below the critical line. Right: critical 
exponents as a function of the temperature. 

fectly one-parameter scaling of the complete distribution 
with fii ~ L one would expect 7/1/ = 73/^ = 74/^ — 
. . . = 1. This property is only approximately fulfilled ac- 
cording to our data. This is shown in Fig. O where also 
the resulting phase diagram is displayed. 

The standard order parameter in percolation problems 
is the relative size of the largest cluster. Since local se- 
quence alignments (and its interpretation as DPRM) ex- 
hibits one spacial dimension, the observable N m /L can 
also be interpreted as order parameter, which is one if the 
alignment covers the entire sequences. The usual finite- 
size-scaling ansatz for the relative size of the largest clus- 
ter order parameter reads as 

l(N m ) T ] (a°P° n ;T;L)/i = L~^°°^ f ((a°P cn - a c )L^) , 

(8) 

where the exponent /3 geo describes the divergence of the 
largest cluster. By comparing this relation with Eq. ([7]) 
we may infer /3 geo = and verify that the scaling relation 
7 + 2/3 gco = dv [2^| (with d— I'm our case) is again only 
approximately fulfilled. We confirmed that (3 — within 
the errorbars by considering (3 as free parameter in the 
finite-size-scaling analysis for N m . 

As mentioned above, the roughness only grows loga- 
rithmically with the system size. This implies that the 
fractal dimension d r of the alignment path equals the 
topological dimension d = 1, which is in agreement with 
the scaling relation d r — d — /3 geo jv in a trivial way. 

B. Energetic properties 

As mentioned above, the size of the largest cluster is 
usually regarded as the order parameter in percolation 
problems. In the non-percolating phase, the size of the 
largest cluster typically grows logarithmically with the 
system size whereas in the percolating phase its cxten- 
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V 


7/1/ 


73/^ 


74/1/ 


T 


0/v 




0.00 


8.306(4) 


1.58(5) 


0.955(8) 


0.920(5) 
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0.947(6) 


0.92(1) 


0.91(1) 
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1.00 
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1.55(2) 
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0.91(1) 


1.00 


0.6443(11) 


0.387(6) 


1.33 


9.339(3) 


1.53(2) 


0.948(8) 


0.92(1) 


0.91(1) 


1.33 


0.6564(09) 


0.380(9) 


2.00 


10.791(3) 


1.51(2) 


0.931(9) 


0.92(1) 


0.89(1) 


2.00 


0.6835(10) 


0.36(1) 


2.50 


12.296(4) 


1.50(4) 


0.921(9) 


0.92(1) 


0.88(1) 


2.50 


0.7081(08) 


0.33(1) 


3.50 


16.227(1) 


1.46(1) 


0.87(1) 


0.870(8) 


0.80(1) 


3.50 


0.7691(06) 


0.23(2) 


4.00 


18.557(2) 


1.38(5) 


0.84(2) 


0.924(5) 


0.76(2) 


4.00 


0.7958(05) 


0.20(2) 



TABLE I: Critical gap open penalty a c and critical exponents 
v, 7, 73,74 for the observable N m . 



CO 

V 



0.01 



7- 

: > 6.5 
? 6 

: V _j5.5 


T = 


— L=5120 

— L=3840 

— ■ L=2560 

— - L=1920 
-- L=1280 
--- L=960 

L=640 

x— * L=480 


A 5- 






\; 




: ^4.5 






4- 


-40 -20 20 
, open . . 1/v 

(a -a c ) L 









a 



open 



10 



FIG. 6: 
length. 



Finite-size-scaling analysis for the average score per 



TABLE II: Critical exponents for the average score / free 
energy per length. j3/v was obtained from finite-size scaling 
(see Fig. [6j and cross checked via the scaling of the first 
moment of the score distribution at criticality a = a c . The 
exponent ui describe the fluctuations of the score distribution 
at a = a c . 



0.0001 
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(S - n)/o 






L=5120 





L=2560 
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L=1280 


t> 


L=640 




sion is comparable to the system size [26j. The aver- 
age score of local alignments exhibits the same crossover 
when crossing the linear-logarithmic boundary. For this 
reason we regard the average score [S] /L (OA), or free 
energy [Ft] IL (FTA) , per length as a second order pa- 
rameter, as in [25|. Note that there is no direct geomet- 
rical interpretation for this quantity. 

Scaling theory states that the order parameter scales 

as 

[F T ] IL = L-V v f ((a - ajL 1 /") (9) 

with some universal scaling function /. This allows one to 
extract the critical value a c and exponents v and (3 from 
the data with the same method as described above. Here, 
we fixed a c and v with the values that have been obtained 
from the data collapse for the observable N m and regard 
P/v as a free parameter. The result for T = is shown in 
Fig. [S] The quality of these fits varied between Q = 1.58 
for T = 1.00 and Q = 7.49 for T = 4.00. 

By regarding v and (3 as free parameters, we also used 
the scaling form Eq. ^ as a second cross check for the 
exponent v. Within the error bars it is comparable with 
the results of the finite-size-scaling analysis for the ob- 



FIG. 7: Rescaled score distributions T = 0. The free-energy 
distributions of FTA look comparable. 

servable N m (for example v — 1.50(7) for T = 0). For 
larger temperatures (T > 2) only system sizes L > 1280 
led to convincing results for this kind of check. 

As can be seen in the second column of tabic Tab. HT1 
the free energy per length [Ft] /L decreases like ~ L^ 13 ^, 
where (3/v increases monotonously with the temperature 
from 0.6324(7) to 0.7958(5). For small temperatures, i.e. 
T < 2, the exponent /3 w 1 is not temperature dependent. 
Hence the phase behavior regarding the exponent j3 is not 
universal any more when exceeding T = 2. 



C. Free-energy distributions close to criticality 

In analogy to the distribution of the observable N m 
in Fig. 21 the resulting rescaled score distributions right 
at a opcn = a c are shown in Fig. [7] Simulations for 
FTA yield comparable results. We performed an analy- 
sis of the moments of P[S] and P[F] respectively. The 
first moment scales as fj,\ ~ L 1 ~^l v as expected from the 
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so 


0.00 


0.2966(4) 


3.182(1) 


37.4(1) 


1.00 


0.2924(1) 


2.900(5) 


24.6(1) 


2.00 


0.2907(2) 


3.122(7) 


31.56(6) 


2.50 


0.2980(2) 


3.16(1) 


38.29(7) 
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TABLE III: Fit parameters of least x 2 -fits of the free-energy 
distributions to the modified Gumbel distribution Eq. {TT} for 
L = 120 in the logarithmic phase. 



finite-size analysis shown in Fig. [6l We checked that the 
fit parameters agree with those from finite-size scaling. 

Regarding the second moment, no divergence was ob- 
served. Its scaling behavior is given by /12 ~ L 1 ^^ with 
lu > 0. The resulting fit parameters are listed in Tab. 
OH In the limit 

a open _j. ^ anc i £ — > 00, the score distri- 
bution is predicted to follow a Gumbel distribution 

Pgambe\(S) = A exp [-X(S - S ) - exp(-A(S - S ))] 



according to the Karlin-Altschul-Dembo theory [3£ 

In the linear phase the conditions of this theory are 
not valid any more. Interestingly, right at the critical 
point, the shape of the distributions are well described 
by a Gumbel distribution, at least in the high probability 
region (down to P[S] ~ 10~ 4 ). In previous studies we 
observed parabolic corrections to Eq. (fTOf that occur in 
the far right tail of the optimal score distribution [38L [39l] . 
The corrected distribution is empirically well described 
by 



1 



P(S) = -Pgumbei(S)exp [-A 2 (S - So) 2 ] , 



(11) 



where A2 is a correction parameter and the normaliza- 
tion constant z' is indistinguishable from 1. We found 
evidence that A2 vanishes for L — > 00 but persists even 



for gapless alignment a 



open 



00 138 



Here, we extend this study to finite-temperature align- 
ment, i.e. to the free-energy distribution as a general- 
ization of the score distribution. We employed gener- 
alized ensemble Monte-Carlo simulations combined with 
Wang-Landau sampling in the sequence space (de- 
tails can be found elsewhere [H[). The (production) run 
for L = 120 employed 4.8 x 10 7 Monte Carlo steps for 
each distribution over the disorder. 

In the following, we use the phase diagram as a guide 
to study the free-energy distribution for various tempera- 
tures. We kept the gap-costs fixed (a°P° n = 12, /3 cxt = 1) 
and only varied the temperature (between T = and 
T = 5). The interpolating points are indicated by stars 
in the phase diagram in Fig. O 

In the logarithmic regime (T — 0, 1, 2, 2.5) the free- 
energy distribution is well described by the modified 
Gumbel distribution Eq. (JTTJ) (see Fig. |5J|. Note that 
we have again rescaled the distributions to unit variance 
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FIG. 8: Rescaled free-energy distribution of finite- 
temperature alignments. At T = 2.50 and below, the data is 
well described by a modified Gumbel distribution. For large 
temperatures an exponential tail is observed. 
Inset: The same data shown with a linear ordinate. In the 
high probability region the data for T — 5.00 is well described 
by a Gaussian distribution. 



and zero mean. The fit parameters only change slightly 
with the temperature (see Tab. IIIip . 

The crossover from the logarithmic to the linear regime 
comes along with a change of the skewness, as can be 
seen in the inset of Fig. [8] In the high probability re- 
gion, for a large value T = 5.00, a Gaussian distribu- 
tion describes the data well. This was confirmed by a 
Kolmogorov-Smirnov test that yielded a p- value of 0.14. 
For T = 1/0.275 « 3.64 the evidence for a Gaussian dis- 
tribution is much smaller (a p-value of 2 x 10~ n ). We 
also checked that the change of the shape is accompanied 
by a change from logarithmic to linear growth of typical 
free energies (the position of the maximum) with the se- 
quence length, i.e. the free energy becomes extensive (not 
shown here). This result can be understood in the fol- 
lowing way: The partition functions that appears in the 
transfer matrix calculation Eq. ([5]) become (more or less) 
independent and hence factorize in the linear phase. The 
total free energy decomposes into a sum of independent 
contributions and the central limit theorem applies. 

When considering the rare-event tail at higher temper- 
atures, the free-energy distribution is rather exponential 
than Gaussian, as can be seen in the main plot of Fig. 
[5] Hence, we observe a crossover from a Gaussian dis- 
tribution in the high probability region to the character- 
istic exponential tail of the Gumbel distribution. With 
the same argumentation as for the optimal alignment, 
sequence pairs appearing in the tail feature high similar- 
ities. The overall free energy is dominated by the ground 
state. This was confirmed by looking at the difference 
between the free energy and the ground-state energy for 
those sequences that occur in the tail of the distribution. 
The summation in the transfer matrix are virtually re- 
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placed by maximization yielding an exponential tail. The 
finite-size effect, that is responsible for the curvature of 
the optimal alignment statistics seems to be of marginal 
order in this case. 



IV. SUMMARY 

We presented a finite-size-scaling analysis of the linear- 
logarithmic phase transition of finite-temperature protein 
sequence alignment. This phase transition is crucial to 
determine the set of parameters where the alignment is 
of highest sensitivity. We have used the blosum62 scor- 
ing matrix together with affine gap costs, which is the 
most frequently used scoring system for actual database 
queries. This goes much beyond previous studies, which 
have investigated only simple scoring systems. 

Two order parameters were studied in detail: the num- 
ber of matches (i.e. the alignment length) and the aver- 
age free energy per length. We have analyzed the phase 
transition using finite-size scaling techniques. Using so- 
phisticated algorithms, large systems could be studied, 
such that corrections to finite-size scaling are negligible. 
The resulting critical line a c (T) in the range T = . . . 4 
provides a guide for biological applications where subop- 



timal alignments play an important role. 

Numerical values of the critical exponents v, 7 and j3 
suggest that the percolation transition is not universal 
with respect to different temperature values. 

The free-energy distribution, which can be seen as a 
generalization of the score distribution over random se- 
quences, crosses over from a modified Gumbel distribu- 
tion with a parabolic correction in the tail given by Eq. 
(fTT]) in the logarithmic phase to a modified Gaussian dis- 
tribution with a linear rare-event tail in the linear phase. 
This is another example showing that the large-deviation 
properties of order-parameter distributions change signif- 
icantly close to phase transitions. 
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